258 research outputs found

    TEASER: Early and Accurate Time Series Classification

    Get PDF
    Early time series classification (eTSC) is the problem of classifying a time series after as few measurements as possible with the highest possible accuracy. The most critical issue of any eTSC method is to decide when enough data of a time series has been seen to take a decision: Waiting for more data points usually makes the classification problem easier but delays the time in which a classification is made; in contrast, earlier classification has to cope with less input data, often leading to inferior accuracy. The state-of-the-art eTSC methods compute a fixed optimal decision time assuming that every times series has the same defined start time (like turning on a machine). However, in many real-life applications measurements start at arbitrary times (like measuring heartbeats of a patient), implying that the best time for taking a decision varies heavily between time series. We present TEASER, a novel algorithm that models eTSC as a two two-tier classification problem: In the first tier, a classifier periodically assesses the incoming time series to compute class probabilities. However, these class probabilities are only used as output label if a second-tier classifier decides that the predicted label is reliable enough, which can happen after a different number of measurements. In an evaluation using 45 benchmark datasets, TEASER is two to three times earlier at predictions than its competitors while reaching the same or an even higher classification accuracy. We further show TEASER's superior performance using real-life use cases, namely energy monitoring, and gait detection

    Cross talk between Wnt/β-catenin and Irf8 in leukemia progression and drug resistance

    Get PDF
    Progression and disease relapse of chronic myeloid leukemia (CML) depends on leukemia-initiating cells (LIC) that resist treatment. Using mouse genetics and a BCR-ABL model of CML, we observed cross talk between Wnt/{beta}-catenin signaling and the interferon-regulatory factor 8 (Irf8). In normal hematopoiesis, activation of {beta}-catenin results in up-regulation of Irf8, which in turn limits oncogenic {beta}-catenin functions. Self-renewal and myeloproliferation become dependent on {beta}-catenin in Irf8-deficient animals that develop a CML-like disease. Combined Irf8 deletion and constitutive {beta}-catenin activation result in progression of CML into fatal blast crisis, elevated leukemic potential of BCR-ABL-induced LICs, and Imatinib resistance. Interestingly, activated {beta}-catenin enhances a preexisting Irf8-deficient gene signature, identifying {beta}-catenin as an amplifier of progression-specific gene regulation in the shift of CML to blast crisis. Collectively, our data uncover Irf8 as a roadblock for {beta}-catenin-driven leukemia and imply both factors as targets in combinatorial therapy

    Preliminary evaluation of the CellFinder literature curation pipeline for gene expression in kidney cells and anatomical parts

    Get PDF
    Biomedical literature curation is the process of automatically and/or manually deriving knowledge from scientific publications and recording it into specialized databases for structured delivery to users. It is a slow, error-prone, complex, costly and, yet, highly important task. Previous experiences have proven that text mining can assist in its many phases, especially, in triage of relevant documents and extraction of named entities and biological events. Here, we present the curation pipeline of the CellFinder database, a repository of cell research, which includes data derived from literature curation and microarrays to identify cell types, cell lines, organs and so forth, and especially patterns in gene expression. The curation pipeline is based on freely available tools in all text mining steps, as well as the manual validation of extracted data. Preliminary results are presented for a data set of 2376 full texts from which >4500 gene expression events in cell or anatomical part have been extracted. Validation of half of this data resulted in a precision of ~50% of the extracted data, which indicates that we are on the right track with our pipeline for the proposed task. However, evaluation of the methods shows that there is still room for improvement in the named-entity recognition and that a larger and more robust corpus is needed to achieve a better performance for event extraction. Database URL: http://www.cellfinder.org

    CELDA - an ontology for the comprehensive representation of cells in complex systems

    Get PDF
    BACKGROUND: The need for detailed description and modeling of cells drives the continuous generation of large and diverse datasets. Unfortunately, there exists no systematic and comprehensive way to organize these datasets and their information. CELDA (Cell: Expression, Localization, Development, Anatomy) is a novel ontology for the association of primary experimental data and derived knowledge to various types of cells of organisms. RESULTS: CELDA is a structure that can help to categorize cell types based on species, anatomical localization, subcellular structures, developmental stages and origin. It targets cells in vitro as well as in vivo. Instead of developing a novel ontology from scratch, we carefully designed CELDA in such a way that existing ontologies were integrated as much as possible, and only minimal extensions were performed to cover those classes and areas not present in any existing model. Currently, ten existing ontologies and models are linked to CELDA through the top-level ontology BioTop. Together with 15.439 newly created classes, CELDA contains more than 196.000 classes and 233.670 relationship axioms. CELDA is primarily used as a representational framework for modeling, analyzing and comparing cells within and across species in CellFinder, a web based data repository on cells (http://cellfinder.org). CONCLUSIONS: CELDA can semantically link diverse types of information about cell types. It has been integrated within the research platform CellFinder, where it exemplarily relates cell types from liver and kidney during development on the one hand and anatomical locations in humans on the other, integrating information on all spatial and temporal stages. CELDA is available from the CellFinder website: http://cellfinder.org/about/ontology

    CellFinder: a cell data repository

    Get PDF
    CellFinder (http://www.cellfinder.org) is a comprehensive one-stop resource for molecular data characterizing mammalian cells in different tissues and in different development stages. It is built from carefully selected data sets stemming from other curated databases and the biomedical literature. To date, CellFinder describes 3394 cell types and 50 951 cell lines. The database currently contains 3055 microscopic and anatomical images, 205 whole-genome expression profiles of 194 cell/tissue types from RNA-seq and microarrays and 553 905 protein expressions for 535 cells/tissues. Text mining of a corpus of >2000 publications followed by manual curation confirmed expression information on ∼900 proteins and genes. CellFinder's data model is capable to seamlessly represent entities from single cells to the organ level, to incorporate mappings between homologous entities in different species and to describe processes of cell development and differentiation. Its ontological backbone currently consists of 204 741 ontology terms incorporated from 10 different ontologies unified under the novel CELDA ontology. CellFinder's web portal allows searching, browsing and comparing the stored data, interactive construction of developmental trees and navigating the partonomic hierarchy of cells and tissues through a unique body browser designed for life scientists and clinicians

    PEDL: extracting protein-protein associations using deep language models and distant supervision

    Get PDF
    MOTIVATION: A significant portion of molecular biology investigates signalling pathways and thus depends on an up-to-date and complete resource of functional protein-protein associations (PPAs) that constitute such pathways. Despite extensive curation efforts, major pathway databases are still notoriously incomplete. Relation extraction can help to gather such pathway information from biomedical publications. Current methods for extracting PPAs typically rely exclusively on rare manually labelled data which severely limits their performance. RESULTS: We propose PPA Extraction with Deep Language (PEDL), a method for predicting PPAs from text that combines deep language models and distant supervision. Due to the reliance on distant supervision, PEDL has access to an order of magnitude more training data than methods solely relying on manually labelled annotations. We introduce three different datasets for PPA prediction and evaluate PEDL for the two subtasks of predicting PPAs between two proteins, as well as identifying the text spans stating the PPA. We compared PEDL with a recently published state-of-the-art model and found that on average PEDL performs better in both tasks on all three datasets. An expert evaluation demonstrates that PEDL can be used to predict PPAs that are missing from major pathway databases and that it correctly identifies the text spans supporting the PPA. AVAILABILITY AND IMPLEMENTATION: PEDL is freely available at https://github.com/leonweber/pedl. The repository also includes scripts to generate the used datasets and to reproduce the experiments from this article. CONTACT: [email protected] or [email protected]

    Deregulation of the endogenous C/EBPβ LIP isoform predisposes to tumorigenesis

    Get PDF
    Two long and one truncated isoforms (termed LAP*, LAP, and LIP, respectively) of the transcription factor CCAAT enhancer binding protein beta (C/EBPbeta) are expressed from a single intronless Cebpb gene by alternative translation initiation. Isoform expression is sensitive to mammalian target of rapamycin (mTOR)-mediated activation of the translation initiation machinery and relayed through an upstream open reading frame (uORF) on the C/EBPbeta mRNA. The truncated C/EBPbeta LIP, initiated by high mTOR activity, has been implied in neoplasia, but it was never shown whether endogenous C/EBPbeta LIP may function as an oncogene. In this study, we examined spontaneous tumor formation in C/EBPbeta knockin mice that constitutively express only the C/EBPbeta LIP isoform from its own locus. Our data show that deregulated C/EBPbeta LIP predisposes to oncogenesis in many tissues. Gene expression profiling suggests that C/EBPbeta LIP supports a pro-tumorigenic microenvironment, resistance to apoptosis, and alteration of cytokine/chemokine expression. The results imply that enhanced translation reinitiation of C/EBPbeta LIP promotes tumorigenesis. Accordingly, pharmacological restriction of mTOR function might be a therapeutic option in tumorigenesis that involves enhanced expression of the truncated C/EBPbeta LIP isoform. KEY MESSAGE: Elevated C/EBPbeta LIP promotes cancer in mice. C/EBPbeta LIP is upregulated in B-NHL. Deregulated C/EBPbeta LIP alters apoptosis and cytokine/chemokine networks. Deregulated C/EBPbeta LIP may support a pro-tumorigenic microenvironment

    Chemical-protein relation extraction with ensembles of carefully tuned pretrained language models

    Get PDF
    The identification of chemical-protein interactions described in the literature is an important task with applications in drug design, precision medicine and biotechnology. Manual extraction of such relationships from the biomedical literature is costly and often prohibitively time-consuming. The BioCreative VII DrugProt shared task provides a benchmark for methods for the automated extraction of chemical-protein relations from scientific text. Here we describe our contribution to the shared task and report on the achieved results. We define the task as a relation classification problem, which we approach with pretrained transformer language models. Upon this basic architecture, we experiment with utilizing textual and embedded side information from knowledge bases as well as additional training data to improve extraction performance. We perform a comprehensive evaluation of the proposed model and the individual extensions including an extensive hyperparameter search leading to 2647 different runs. We find that ensembling and choosing the right pretrained language model are crucial for optimal performance, whereas adding additional data and embedded side information did not improve results. Our best model is based on an ensemble of 10 pretrained transformers and additional textual descriptions of chemicals taken from the Comparative Toxicogenomics Database. The model reaches an F1 score of 79.73% on the hidden DrugProt test set and achieves the first rank out of 107 submitted runs in the official evaluation. Database URL: https://github.com/leonweber/drugprot

    PEDL+: Protein-centered relation extraction from PubMed at your fingertip

    Get PDF
    Relation extraction (RE) from large text collections is an important tool for database curation, pathway reconstruction, or functional omics data analysis. In practice, RE often is part of a complex data analysis pipeline requiring specific adaptations like restricting the types of relations or the set of proteins to be considered. However, current systems are either non-programmable web sites or research code with fixed functionality. We present PEDL+, a user-friendly tool for extracting protein-protein and protein-chemical associations from PubMed articles. PEDL+ combines state-of-the-art NLP technology with adaptable ranking and filtering options and can easily be integrated into analysis pipelines. We evaluated PEDL+ in two pathway curation projects and found that 59% to 80% of its extractions were helpful
    corecore